Skip to main content
Scour
Discover
Docs
Login
Sign Up
Discover
About
Docs
Changelog
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Fast AI Inference
⚡ Fast AI Inference
Cerebras, Groq, fast LLM tokens
Filter Results
Timeframe
Choose a timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
107
posts in
51.2
ms
🤖
AI
GitHub
·
5d
5 days ago
ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (
vLLM
, Ollama, LM Studio,
llama.cpp
).
Covers
uv
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).
🔓
Open Source AI
Anyscale blog posts
·
2d
2 days ago
High Performance Distributed
Inference
with Ray Serve
LLM
Covered by
Google Cloud Blog
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for High Performance Distributed Inference with Ray Serve LLM
🔓
Open Source AI
mstar.stanford.edu
·
2d
2 days ago
M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models
🏗️
LLM Infrastructure
ByteByteGo Newsletter
·
5d
5 days ago
A Guide to
AI
Inference
Engineering
Covers
6 stories
See all stories this covers
including
Efficient Memory Management for Large Language Model Serving with PagedAttention
Covered by
tldr.tech
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for A Guide to AI Inference Engineering
🆕
New AI
huggingface.co
·
2d
2 days ago
225B-A23B
Covered by
news.smol.ai
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for 225B-A23B
🔓
Open Source AI
OpenRouter
·
5d
5 days ago
Free
LLM
APIs Compared: Rate Limits, Models, and Real Costs (2026)
Covers
6 stories
See all stories this covers
including
Ollama
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Free LLM APIs Compared: Rate Limits, Models, and Real Costs (2026)
🤖
AI
unsloth.ai
·
1d
1 day ago
GLM-5.2 – How to Run Locally
Covers
2 stories
See all stories this covers
including
GitHub here . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inferen...
Covered by
news.smol.ai
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for GLM-5.2 – How to Run Locally
🏗️
LLM Infrastructure
abhishek.it
·
2d
2 days ago
Running GLM-5.2 5x
faster
at 500tps with limitation
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Running GLM-5.2 5x faster at 500tps with limitation
🤖
AI
GitHub
·
5h
5 hours ago
Second
Brain – A free, invisible
AI
interview copilot (
Groq
and Llama 3)
Covers
Groq Infrastructure For Inference built for speed, quality, cost and scale
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Second Brain – A free, invisible AI interview copilot (Groq and Llama 3)
🤖
AI
rocm.blogs.amd.com
·
4d
4 days ago
Unlocking Extreme AMD Instinct
Inference
with Software-Hardware Co-Optimization
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization
🔓
Open Source AI
alper.bearblog.dev
·
1d
1 day ago
Activate Gemma 4 MTP
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Activate Gemma 4 MTP
🏗️
LLM Infrastructure
Google Cloud Blog
·
3d
3 days ago
Scaling Ray Serve
LLM
on GKE: Performance without losing the developer experience
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Scaling Ray Serve LLM on GKE: Performance without losing the developer experience
🏗️
LLM Infrastructure
arxiv.org
·
5d
5 days ago
Solyx
AI
Grid: Hardware-Telemetry-Aware Routing Across Geographically Distributed GPU Clusters
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Solyx AI Grid: Hardware-Telemetry-Aware Routing Across Geographically Distributed GPU Clusters
🤖
AI
latent.space
·
2d
2 days ago
[AINews] GLM > GPT? GLM-5.2 passes vibe check;
Z.ai
forecasts Open Fable by December
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for [AINews] GLM > GPT? GLM-5.2 passes vibe check; Z.ai forecasts Open Fable by December
🧠
Inference Serving
Towards AI
·
2d
2 days ago
Continuous Batching: How to Keep Your GPU Actually Busy
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Continuous Batching: How to Keep Your GPU Actually Busy
🧩
MoE
ServeTheHome
·
5d
5 days ago
Tensordyne
Napier
AI
Processor Announced with Logarithmic Math
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Tensordyne Napier AI Processor Announced with Logarithmic Math
🤖
AI
GitHub
·
10h
10 hours ago
Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)
🔧
Developer tools
spectrum.ieee.org
·
5d
5 days ago
Tensordyne
's Wild Log Math Aims to Leave Nvidia’s
AI
Chips In the Dust
Covers
2 stories
See all stories this covers
including
The AWS Community Builders program is now accepting applications
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Tensordyne's Wild Log Math Aims to Leave Nvidia’s AI Chips In the Dust
🧠
Memory Management
thecomputersciencebook.com
·
5d
5 days ago
PagedAttention is more than virtual memory
Covers
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for PagedAttention is more than virtual memory
🧠
LLM Inference
arxiv.org
·
2d
2 days ago
UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report